Introduction to Medical Statistics 2024
Exercises Class VIII
Logistic regression
Exercises Class VIII
Logistic regression
Exercise 1: Univariable logistic regression
This exercise uses the dataset cmTbmData.csv, which contains information on 201 patients with meningitis from 4 different patient groups. For this exercise, we will restrict attention to HIV-positive patients and explore how the CSF white cell count affects the probability of having TBM (compared to having CM). For this exercise, you need to load the packages ggplot2, gtsummary and ggeffects.
- Import the dataset (select “stringsAsFactors”) and create a new dataset which contains only HIV-positive patients. Because csfwcc has a rather skewed distribution, we create a new variable log2.csfwcc in the dataset which contains the log2-transformed values (more precisely use log2(csfwcc+1) to deal with counts of zero).
Create a boxplot of log2.csfwcc by diagnosis (CM and TBM) to get a first visual impression of the data. Add the individual measurements to the boxplot.
- How does log2.csfwcc affect the probability of having TBM compared to CM? Perform a univariable logistic regression, summarize the model fit and interpret the resulting odds ratio.
The logistic regression model as implemented in the glm function requires the outcome to be a variable of the R type factor or a variable with values 0 or 1. If you didn’t specify stringsAsFactors = TRUE you will get an error message. It may not be clear to you which diagnosis is interpreted as “0” (the reference value “no event”) and which as “1” (the “event” value). By default, this is determined by alphabetical order of the levels: the first level acts as reference or “no event”, the second level is the “event”. Hence, CM is the reference, and we model the probability to have TBM as event. Another approach is to create a variable 0 for CM patients and 1 for TBM patients and then use this as the outcome. Try both approaches.
- Based on the model from b), what is the predicted probability that a subject with log2.csfwcc=6 has TBM? Calculate the answer to this question in 3 ways:
- “By hand” based on the regression coefficients and the logistic regression model (slide 16).
- Using the predict function
- With the help of the ggpredict function from the ggeffects package, which automatically adds confidence intervals
Make a figure that plots the probability to have TBM for all values of csfwcc. Apply the plot function to the output from the ggpredict function.
Exercise 2: Multivariable logistic regression
Prediction of re-shock in patients with DSS (dengue shock syndrome)
The dataset DF.csv contains data from 2007 children with DSS which were recruited into the DF study between 2003-2009. For this exercise, we aim to predict the occurrence of re-shock based on the 268 subjects recruited into the DF study in 2009.
- Import the DF.csv dataset and create a new dataset df2009 which contains only the 268 subjects recruited in 2009 and the variables:
- Outcome Y: re-shock (reshock)
- Covariables X (measured at onset of shock): Age (age), sex, day of illness at shock (day_ill), temperature (temp), platelet count (plt), hematocrit (hct).
Look at descriptive statistics for the outcome and the covariables using the summary function. How many re-shocks occur in the 268 subjects? Does any of the covariables have missing values? Do you notice something peculiar about the temperature values?
Additionally make a summary of the covariables by outcome value, using the tbl_summary function from the gtsummary package. Do you recommend a log-transformation of the platelet count or hematocrit?
- Perform univariable logistic regressions for each covariable separately on outcome and interpret the results. Which covariables clearly show an association with the occurrence of re-shock?
Note the values of plt on the original scale. What is happening here? Do you have a suggestion to change this.
- Perform a multivariable logistic regression with all covariables jointly and interpret the results. Which covariables are significant after adjustment for all others and what is their effect size?
- Based on c), age and day of illness are two important predictors of re-shock. Is there any evidence that age or the day of illness at shock affect the outcome non-linearly? Evaluate this by performing a test for the presence of a quadratic effect in each. Hint: compare both models using the anova function.
- The effect of hematocrit on the risk of re-shock might be different for males and females. Does the data provide any evidence for an interaction between sex and hematocrit?